A Kernel Independence Test for Geographical Language Variation
نویسندگان
چکیده
Quantifying the degree of spatial dependence for linguistic variables is a key task for analyzing dialectal variation. However, existing approaches have important drawbacks. First, they are based on parametric models of dependence, which limits their power in cases where the underlying parametric assumptions are violated. Second, they are not applicable to all types of linguistic data: some approaches apply only to frequencies, others to boolean indicators of whether a linguistic variable is present. We present a new method for measuring geographical language variation, which solves both of these problems. Our approach builds on Reproducing Kernel Hilbert space (RKHS) representations for nonparametric statistics, and takes the form of a test statistic that is computed from pairs of individual geotagged observations without aggregation into predefined geographical bins. We compare this test with prior work using synthetic data as well as a diverse set of real datasets: a corpus of Dutch tweets, a Dutch syntactic atlas, and a dataset of letters to the editor in North American newspapers. Our proposed test is shown to support robust inferences across a broad range of scenarios and types of data.
منابع مشابه
A Kernel Statistical Test of Independence
Although kernel measures of independence have been widely applied in machine learning (notably in kernel ICA), there is as yet no method to determine whether they have detected statistically significant dependence. We provide a novel test of the independence hypothesis for one particular kernel independence measure, the Hilbert-Schmidt independence criterion (HSIC). The resulting test costs O(m...
متن کاملThe Application of Geographical Information System in Explaining Spatial Distribution of Low Birth Weight; a Case Study in North of Iran
Background: Geographical Information System is a new tool in environmental epidemiology that makes the opportunity of visualization and analysis of spatial data. The aim of this study was to determine the geographic variation of low birth weight using geographic information system in order to evaluate the efficacy of primary health care and health information system. Methods: Low birth weight r...
متن کاملSelf-Discrepancy Conditional Independence Test
Tests of conditional independence (CI) of random variables play an important role in machine learning and causal inference. Of particular interest are kernel-based CI tests which allow us to test for independence among random variables with complex distribution functions. The efficacy of a CI test is measured in terms of its power and its calibratedness. We show that the Kernel CI Permutation T...
متن کاملIndependence Tests based on the Conditional Expectation
In this paper we propose a new procedure for testing independence of random variables, which is based on the conditional expectation. As it is well known, the behaviour of the conditional expectation may determine a necessary condition for stochastic independence, that is, the so called mean independence. We provide a necessary and sufficient condition for independence in terms of conditional e...
متن کاملA Wild Bootstrap for Degenerate Kernel Tests
A wild bootstrap method for nonparametric hypothesis tests based on kernel distribution embeddings is proposed. This bootstrap method is used to construct provably consistent tests that apply to random processes, for which the naive permutation-based bootstrap fails. It applies to a large group of kernel tests based on V-statistics, which are degenerate under the null hypothesis, and nondegener...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Computational Linguistics
دوره 43 شماره
صفحات -
تاریخ انتشار 2017